Learning Objectives & Strategies:
Familiarize with the concept of Move 1, "Establish a Territory".
1. Read the three moves below and think about what each moves XXdoes.
2. Keep the Move titles (e.g. Move 1: Establish a Territory) in XXXXmind when you read and compare.
3. Find out how Move 1 is different from Move 2 & 3.
4. Click on "Move 1 analysis" to see a close analysis on Move 1.
5. When ready, click here to take the exercises!
Move
Title: Pitfalls in Corpus Research
Author(s): TONI RIETVELD, ROELAND VAN HOUT and MIRJAMERNESTUS
Journal: Computers and the Humanities?38?(2004).?
Move 1: Establish A Territory
The use of corpora has become common in language research over the last decades. In many branches of linguistics, corpora provide core data for survey research and for the development and testing of hypotheses. The origins of these corpora can be manifold: texts from the Middle Ages, series of samples from current newspapers, essays written by school pupils, or letters written by emigrants to those who stayed behind. Corpora of speech may just include transcripts, but rapid developments in storage capacity and computational power have made the availability of sound and video signals a reality. Research tools have been developed to make these corpora easily accessible.
Move 2: Establish A Niche
In spite of the rapid developments in corpus-based research, some basic problems with this type of research have not received the interest they deserve. Several pitfalls keep showing up, related both to the transcription and coding of corpus data, and to their analysis. In this paper, we address some of the pitfalls.
Move 3: Present the Present Work
In Section 2, we start with transcription and coding, where conflicting judgments between experts or evaluators quite often show up. The degree of conflict can be made clear by calculating agreement indices. Moreover, we will show how data on which disagreement occurs ought to be dealt with in the analysis. The statistical analysis of frequency data is the central topic of Section 3. Basically, the analysis of this type of data is fairly straightforward. The primary technique is v2 analysis, a technique explained in introductory textbooks on statistics. An important assumption of v2 analysis and equivalent statistics is the independence of observations, and precisely this assumption is problematic in corpus research. We show how two kinds of dependences may interfere in the statistical analysis, both resulting in a Type I error which is too high; (that is to say that) the significance of an effect is claimed too often where in fact there is no effect. Section 4 deals with two other well-known problems in v2 analysis, viz. the effects of small and large samples. Small samples tend to yield few significant effects, while the ‘high significance’ levels obtained with large samples are often incorrectly interpreted as indicators of substantial effects. For small samples the concept of power is relevant.
For large samples, we need an index which expresses the size of an effect, independently from the sample size. In Section 5, we discuss the use of the log odds ratio as an alternative to v2 analysis. Its use is still quite rare in corpus analysis, although it has outstanding statistical properties. Log odds form the basis of attractive multivariate techniques, such as logit analysis and logistic regression.